Introduction

Dataset_1: Total Immigration for 2019

The United Nations, Department of Economic and Social Affairs published dataset presents estimates of international migrant by age, sex and origin. Estimates are presented for 1990, 1995, 2000, 2005, 2010, 2015 and 2019 and are available for all countries and areas of the world. The estimates are based on official statistics on the foreign-born or the foreign population.

Read the data into R

Remove Regions and keep only Countries

First thing you would notice is that the first 8 rows are not countries, they are regions. This time we want to see how people are migrating from countries to countries so we can remove these rows for the regions. When you look at ‘X6’ column, it looks that those ‘region’ rows don’t have any value there.

By running the command like below to keep only the rows whose X6 column have NA

These are all regions, not countries, which means that we can safely remove these rows by adding an exclamation mark ‘!’ right before ‘is.na()’ function like below.

Now we got a dataframe of 1,624 rows with 246 columns as return with all countries only data for each year.

Remove unnecessary columns

When you look at the columns we would notice that there are unnecessary columns like ‘Total’, ‘Other South’, etc, because we are interested in estimates of the migrants only for countries to countries. We can remove those unnecessary columns with ‘select’ command along with other unnecessary columns like below.

I’m using minus ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.

Rename Columns

Gather 235 columns to make it tidy

Now, it’s ready to tidy this ‘matrix’-ish data form by using ‘gather’ command from tidyr package.

Perform Statistics

Here we want to know which countries has the highest number of migrant by year . To do so, I would use

## Classes 'tbl_df', 'tbl' and 'data.frame':    78754 obs. of  4 variables:
##  $ year               : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ destination_country: chr  "Namibia" "South Africa" "Egypt" "Libya" ...
##  $ origin_country     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ migrants           : num  64 59 237 677 254 ...
##   ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame':  345 obs. of  4 variables:
##   .. ..$ row     : int  496 546 597 648 1397 1419 1589 1644 1938 1956 ...
##   .. ..$ col     : int  NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ expected: chr  "a number" "a number" "a number" "a number" ...
##   .. ..$ actual  : chr  "-" "-" "-" "-" ...

As we can see here, we got more than 78,700 colums. I would prefer to take a subset of the data based on some statistical analysis. To do so, we can get mean, median to set a filtering cretria. Additionally, I will add a ranking column to the dataframe to rank the countries

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0       39      280    17319     2342 12168662      345

Now we get a much tider data about 7,000 columns, only 4 columns. However, we can use spread() function to group by year. to see

Export to csv file

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

## [1] TRUE

open the file from local machine

Visualization

I would prefer to explore the data quickly using the Heatmap chart. This can be done by assigning ‘destination_country’ to X-axis, ‘origin_country’ to Y-axis, and ‘migrants’ to Color.

Note This group is for countries that have total number of migrants above average ~ 13000 migrant per year. First step, we can plot a boxplot to explore the data

## [1] "Afghanistan" "Albania"     "Algeria"     "Andorra"     "Angola"     
## [6] "Argentina"

As we can see from the plot, for the top 100 destination countries that have most of the migrants populations over seven period of time. We will get more insights on 2015 and 2019 and inspect what is the increasing rate for the top 10 migration countries.

We got some interesting results here, for instance, United states of America has the highest migrants population over the three time intervals. However, the total migrants that USA accepts decreased by 1 million migrants in 2019. Turkey started accepting migrants in 2019 time interval with a significant number more than 3.5M migrants per interval. As an overall observation, more countries accepting migrants throught time where 2019 interval has the the top 6 countries accepting population.

For the next steps, I am interested in investigating what is the migrants’ origin countries that migrate to the United States and Turkey.

##  Named list()
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The majority of migrant population to the United States of America is from Mexico. Also, it seams that the Turkey accepted more than 3.7M migrant from Syrian Arab Republic in 2019 interval. The other main origin countries that have most of the population are India, Ukrain, and Russian Federation respectively with no major difference in the total population.

As illustrated, we can concluded that United States is the top destination for Mexicans. However, Indians like to live in the United Arab Emirates at the same time Syrians would rather to flee to Turkey

Dataset_2: UN_Migrant stock by age

Import the dataset

As the previous dataset, we need to extract only the countries not regions. To do so, we can

We now have 1,624 row with 62 columns. Now it is time to do some cleaning up.

Remove unnecessary columns

When you look at the columns we would notice that there are unnecessary columns like ‘Total’, etc, because we are interested in estimates of the migrants age category for males and female only for countries. We can remove those unnecessary columns with ‘select’ command along with other unnecessary columns like below.

Now we have 1624 rows with a 34 columns ### Subset the data - Age categories for both sexes

Since we are focusing here on what is the most age categories that had migrated from their origin countries to another countries. So I decided to subset the data into three subsets, both categories, males only, and females only. We are interested here in both categories.

Rename the columns

Now we have 198 columns, we can make it more tider by devide the age categories into 4 familiar categories.

child 0 - 19
youth 20 - 39
———- ———
adult 40 - 59
———- ———
senior 60 - 75+
———- ———

we will iterate over the subset and sum each 4 column into one, but first, we need to convert the column type from char to num to be able to complete the summation.

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

## [1] TRUE

open the file from local machine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    1227    6068   55709   30883 5485048     480

I used dplyr::leftjoin -> Reduce function along with merge to combine all 4 dataframes into one.

Now based on the previous analysis in Dataset_1, we will pick only the top countries that have majority of population to get more insights how many migrants for each age category. I will look into the top 6 destinations USA, Turkey, United Arab Emirates, India, Ukraine, and Russian Federation.

First, we need to condense the age groups into categorial factor to be easy to visualize. To do that, we can construct a for-loop to iterate over the merged_age dataframe and change the value into the category we want child, youth, adult, and senior migrants.

Dataset_3: UN_Migrant_refugee stock total 2019

Import the dataset

Remove unnecessary columns

Now we have 232 rows with a 22 columns

Subset the data - Estimate refugee stock including asylum seekers

Rename the columns

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

## [1] TRUE

open the file from local machine